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The invention relates to speech processing, such as speech recognition or 
speech coding* of a degraded speech signal. 

Increasingly automatic speech recognition and coding systems are used. 
Although the performance of such systems is continuously improving, it is desired that the 
accura:y be increased further, particularly in adverse environments, such as having a low 
signal-to-noise ratio (SNR) or a low bandwidth signal. Normally, speech recognition systems 
compare a representation Y t such as an observation vector with LPC or cepstral components, 
of an input speech signal against a model A x of reference signals, such as hidden Markov 
models. (HMMs) built from representations X, such as reference vectors, of a training speech 
signal. 

In practice a mismatch exists between the conditions under which the reference 
signals (and thus the models) were obtained and the input signal conditions. Such a mismatch 
may, in particular, exist in the SNR and/or the bandwidth of the signal. The reference signals 
are usually relatively clean (high SNR, high bandwidth), whereas the input signal during 
actual use is distorted (lower SNR, and/or lower bandwidth). 



US 5,727,124 describes a stochastic approach for reducing the mismatch 
betweei the input signal and the reference model. The known method works by using a 
20 maxim om-likelihood (ML) approach to reduce the mismatch between the input signal 

(observed utterance) and the original speech models during recognition of the utterance. The 
mismatch may be reduced in the following two ways: 

• A representation Y of the distorted input signal can be mapped to an estimate of an original 
representation X, so that the original models A* which were derived from the original 

25 signal representations X can be used for recognition. This mapping operates in the feature 

space and can be described as F/Y) t where v are parameters to be estimated. 

• The original models A x can be mapped to transformed models A y which better match the 
observed utterance Y. This mapping operates in the model space and can be described as 
Gf/^x), where tj represents parameters to be estimated. 
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2 01.07.1999 
The pjirameters v and/or f] are estimated using the expectation maximuation algorithm to 
iteratr/ely improve the likelihood of the observed speech Y given the models A* The 
stochastic matching algorithm operates only on the given test utterance and the given set of 
speecl models. No training is required for the estimation of the mismatch prior to the actual 
5 testing . The mappings described in US 5 ,727, 1 24 are hereby included by reference. 

Both methods may also be combined; where the representation Y of die 
distort Ml input signal is mapped to an estimate of an original representation X and the original 
model, i Ac are mapped to transformed models which better match the estimated representation 
X. The methods may be used in an iterative manner where the transformed signal and/or the 
10 transformed models replace the respective original input signal and/or models. In this way the 
input signal and models are iteratively transformed to obtain a statistical closer match between 
the input signal and the models. In this process a relatively noisy input signal may get 
transformed to a cleaner input signal, whereas relatively clean models might get transformed 
to mor % noisy models. 

1 5 For recognition, models are usually trained under the best (clean) conditions in 

order to obtain optimal recognition. In the known method, the models are transformed based 
on the Jistbmfrlnput^ particularly for low SNR ratios, 

making it difficult*to obtain the optimal performance which could>be achieyed with the 
original models. Moreover, if the mismatch between the original models and the input signal is 

20 significant, the risk of transforming the signal and/or models in a wrong direction increases 
(albeit that they statistically may come closer). This is for instance the case if the input signal 
has a low signal to noise ratio, making it difficult to reliably estimate the original signal. 



25 It is an object of the invention to provide a speech processing method and 

speech processing system capable of improved speech processing particularly under adverse 
conditions. 

To achieve the object of the invention, the method of processing a degraded 
speech input signal includes: 
30 - receiving the degraded speech input signal;. 

- estimating a condition, such as the signal-to-noise ratio or bandwidth, of the 
receive! input signal; 

- selecting a processing model corresponding to the estimated signal condition; 
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- estimating an originally uttered speech signal based on the received Input 

signal; 

- processing the estimated original signal according to the selected model; and 

- outputdng a processing result. 

5 In the method according to the invention, starting from an initial estimate of a 

condition of the signal (e.g. SNR or bandwidth), a processing model is selected, where the new 
model is a function of the estimated signal condition. Preferably, a model is selected which 
was optimally trained for the signal condition. Also an estimate is made of the originally 
uttered speech. By both selecting an appropriate model and estimating the original speech, the 
10 processing accuracy improves in a "push-pull" manner. In the known system, the current 
^ model is transformed to a new one, where the transformation is a function of the input signal Y 
{A y = G n (AjJ). In the method according to the invention, no model transformation takes place, 
avoiding degradation of the model. Instead a model matching the estimated signal condition is 
used. 

IS As described in the dependent claim 2, the estimate of the originally uttered 

speech is based on a predetermined processing model At- Preferably, the estimate is based on a 
Maximum likelihood Estimation (MLB). For instance, the MLE approach of US 5,727,124 

A A 

may be used, wherein the estimated original speech X is given by: X = F V (X) . where the 

parameters v are given by: v = arg maxf |x,v|5 f A,|. 

^0 As described in the dependent claim 3, the processing model used for 

estimating the original speech is the model AJQ selected to match the estimated signal 
condition %. In this way the accuracy of estimating the original signal is increased. 

As described in the dependent claim 4, an iterative procedure is used, wherein 
in each iteration, the signal condition is re-estimated, a new model is selected based on the 
25 new signal condition, and a new estimate is made of the original speech (using the then 
selectee! model). The model which was selected first acts as a discrimination seed for the 
further bootstrap operations. The iteration stops when a criterion is met (e.g. the recognition 
with thti then selected model is adequate or no longer improves (e.g. gets worse) compared to 
a likelihood obtained by a previous recognition). The iteration process may start with a 
30 conservative estimate of the degradation of the signal (e.g. a relatively high SNR), where in 
each iteration the signal condition is degraded (e.g. a lower SNR is selected). 
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4 ^ QL07.1999 

To meet the object of the invention, the speech processing system for 
process ing a degraded speech input signal includes: 

- an input receiving the degraded speech input signal; 

- means for estimating a condition, such as the signal-to-noise ratio or 
5 bandwi dth, of the received input signal; 

- means for selecting a processing model corresponding to the estimated signal 

condition; 

. means for estimating an originally uttered speech signal based on the received 

input s: gnal; 

10 - means for processing the estimated original signal according to the selected 

model; and 

- an output for outputting a processing result, 



15 These and other aspects of the invention will be apparent from and elucidated 

with reference to the*embodimcnts*shown in thedrawings. 

Fig. 1 shows a block diagram of a conventional speech processing system 
wherei i the invention can be used; 

Kg. 2 illustrates conventional word models used in speech processing; 
20 Fig. 3 illustrates an iterative embodiment of the method according to the 

invention; 

Kg. 4 shows a block diagram of a speech processing system according to the 

invention; 

Fig. 5 shows a block diagram of a speech processing system wherein the 
25 method according to the invention is exploited twice, to overcome SNR and bandwidth 
degradation; and 

Figs. 6, 7, and 8 illustrate results obtained with the method and*system 
according to the invention. 

30 

General description of a speech recogpition system. 

Speech recognition systems, such as large vocabulary continuous speech 
rccogn don systems, typically use a collection of recognition models to recognize an input 
pattern. For instance, an acoustic model and a vocabulary may be used to recognize words and 
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5 01.07,1999 
a language model may be used to improve the basic recognition result Figure 1 illustrates a 
typical structure of a large vocabulary continuous speech recognition system 100 [refer 
L.Rabiner, B-H. Juang, "Fundamentals of speech recognition", Prentice Hall 1993, pages 434 
to 454 1 . The following definitions are used for describing the system and recognition method: 
S A x : a set of trained speech models 

X: the original speech which matches the model, A x 

Y: the testing speech 

A y \ the matched models for testing environment 

W. the word sequence. 

10 S: the decoded sequences that can be words, syllables, sub-word units, states or 

| mixture components, or other suitable representations. 

The system 100 comprises a spectral analysis subsystem 110 and a unit 
matching subsystem 120. In the spectral analysis subsystem 1 10 the speech input signal (SIS) 
is spectrally and/or temporally analyzed to calculate a representative vector of features 
1 5 (obseivation vector, OV). Typically, the s peech si gnal is digitized (e.g. sampled at a rate of 
6.67 1Hz.) and pre-processed, for instance by applying pre-emphasis . Consecutive samples are 
grouped (blocked) into frames, corresponding to, for instance, 32 msec, of speech signal. 
Successive frames partially overlap, for instance, 16 msec. Often the Linear Predictive Coding 
(LPC> spectral analysis method is used to calculate for each frame a representative vector of 
20 featuies (observation vector). The feature vector may, for instance, have 24, 32 or 63 

components. The standard approach to large vocabulary continuous speech recognition is to 
^ assume a probabilistic model of speech production, whereby a specified word sequence W « 
wiW2W 3 ,..w q produces a sequence of acoustic observation vectors Y = yiyays...yT. The 
recognition error can be statistically minimized by determining the sequence of words 
25 wiW2W3„.w q which most probably caused the observed sequence of observation vectors 

yiY2>3 -yT (over time ts:!,..., T), where the observation vectors are the outcome of the spectral 
anal>sis subsystem 110. This results in determining the maximum a posteriori probability: 

max P(W[Y, A x \ for all possible word sequences W 
By allying Bayes 1 theorem on conditional probabilities, P(W|Y, A x ) is given by: 
30 P(Wlr>A)ai ^I^AJ.P(W) 

Sine; P(Y) is independent of W, the most probable word sequence is given by: 

W = arg max P(Y, W | A J = arg maxP(y | W,A,)JW) (D 
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In the unit matching subsystem 120, an acoustic model provides the first term 
of equation (1). The acoustic model is used to estimate the probability P(YjW) of a sequence 
of observation vectors Y for a given word string W. For a large vocabulary system, this is 
usually performed by matching the observation vectors again6t an inventory of speech 
5 recogr ition units, A speech recognition unit is represented by a sequence of acoustic 

references. Various forms of speech recognition units may be used. As an example, a whole 
word or even a group of words may be represented by one speech recognition unit. A word 
model (WM) provides for each word of a given vocabulary a transcription in a sequence of 
acoustic references. In most small vocabulary speech recognition systems, a whole word is 

10 represented by a speech recognition unit, in which case a direct relationship exiBts between the 
word :nodel and the speech recognition unit. In other small vocabulary systems, for instance 
used for recognizing a relatively large number of words (e.g. several hundreds), or in large 
vocabulary systems, use can be made of linguistically based sub-word units, such as phones, 
diphoies or syllables, as well as derivative units, such as fenenes and fenones. For such 

1 5 systems , a word model is gi yen-by -a lexicon^ 34, describing the sequence of sub-word units 
relating to a word.ofcthe^ocabulaiyj and thersub*word:models 4 32&describing sequences of 
acoustic referencerofithe involved spcech^recognition unit. A wordrmodel^composer 136 
compases the word >m<^l*bascdson'the*subw 

Figure^A illustrates a word*model 200 for a system based on-whole-word 

20 speech recognition units^where the speech-recognition unit o£the>shown word is modeled 
using a sequence often acoustic references (201 to 210). Figure 2B illustrates a word model 
220 for a system based on sub-word units, where the shown word is modeled by a sequence of 
three sub- word models (250* 260 and 270), each with a sequence of four acoustic references 
(251, 252, 253, 254; 261 to 264; 271 to 274). The word models shown in Fig. 2 are based on 

25 Hiddsn Markov Models (HMMs), which are widely used to stochastically model speech 

signals. Using this model, each recognition unit (word model or subword model) is typically 
characterized by an HMM, whose parameters are estimated from a training set of data* For 
large vocabulary speech recognition systems usually a limited set of, for instance 40, sub-word 
units is used, since it would require a lot of training data to adequately train an HMM for 

30 larger units; An^HMM state>corFespondstto.annaco.ustic reference. Variousvteehniques are 

knovm for modeling a reference, including discrete or continuous probability densities. Each 
sequence of acoustic references which relate to one specific utterance is also referred as an 
acoustic transcription of the utterance. It will be appreciated that if other recognition 
techniques than HMMs axe used, details of the acoustic transcription will be different. 
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A word level matching system 130 of Fig. 1 matches the observation vectors 
againsi all sequences of speech recognition units and provides the likelihoods of a match 
between the vector and a sequence. If sub-word units are used, constraints can be placed on 
the matching by using the lexicon 134 to limit the possible sequence of sub-word units to 
5 sequences in the lexicon 134. This reduces the outcome to possible sequences of words. 

Furthermore a sentence level matching system 140 may be used which, based 
on a hjiguage model (LM), places further constraints on the matching so that the paths 
invest gated are those corresponding to word sequences which are proper sequences as 
specified by the language model* As such the language model provides the second term P(W) 
10 of equation (1). Combining the results of the acoustic model with those of the language model, 
( results in an outcome of the unit matching subsystem 120 which is a recognized sentence (RS) 
152. The language model used in pattern recognition may include syntactical and/or 
semantical constraints 142 of the language and the recognition task. A language model based 
on syntactical constraints is usually referred to as a grammar 144. The grammar 144 used by 
15 the language model provides the probability of a word sequence W m wiW^wa.^Wq, which in 
principle is given by: 

P(W) = P(Wi)P(W2|w 1 ).P(W 3 |WiW 2 ).. JP(Wq| WiW 2 W3...W q ). 

Since in practice it is infeasible to reliably estimate the conditional word probabilities for all 
word; and all sequence lengths in a given language, N-gram word models are widely used. In 

20 an N-gram model, the term P(wj| wl w2w3,..wj-l) is approximated by P(wj| wj-N+l.,.wj-l). In 
practice, bigrams or tri grams are u6ed. In a trigram. the tennP(wj| wlw2w3,.wj-l) is 
approximated by P(wj| wj-2wj-l), 
* The speech processing system according to the invention may be implemented 

using conventional hardware. For instance, a speech recognition system may be implemented 

25 on a computer, such as a PC, where the speech input is received via a microphone and 

digitized by a conventional audio interface card. All additional processing takes place in the 
form of software procedures executed by the CPU. In particular, the speech may be received 
via a telephone connection, e.g. using a conventional modem in the computer. The speech 
processing may also be performed using dedicated hardware, e.g. built around a DSP. 
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Detailed description of the invention: 

According to the invention, a matching algorithm is used to overcome the 
matched performances for robust speech recognition. Preferably, the algorithm is used in an 
iterative manner, and the matching is based on a stochastic matching: the Successive 
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Stochastic Matching (SSM) algorithm. The algorithm may in principle be used to deal with 
any deeded signal condition, In particular, wo parametric forms are described. The first one 
is callei "SNR-incremental stochastic matching (SISM)" for noisy speech recognition, where 
SNR dimotes signal-to-noise ratio; the second one is called "bandwidth-incremental stochastic 
5 matching (BJSM)" to improve the recognition accuracy of narrow-band Bpeech and to 

approach the performances of the speech models trained from high quality microphone speech. 
Both forms of the algorithms may also be combined The algorithm is specifically suitable for 
telephone speech recognition. However, it may also be used, for instance, for speech 
recognition where a microphone is directly connected to a processing unit, such as a PC, 
10 although in this case the signal condition is in general better, so that less improvement can be 
achieved. In the algorithm according to the invention, a bootstrapped and, preferably, well- 
retrained model which has good discrimination characteristics is used to improve the 
recognition, the bootstrap operation. This is preferably repeated during each iteration. Besides 
speech recognition, the algorithm can also be used for speech coding (particularly for transfer 
15 via a telephone system). For this applications-bootstrap codebooks/eneoders axe used instead of 
bootstrap models/recognisers, i.eM x (0 denotes the bootstrap codebooks for coding instead of 
a speech recognition models 

The iierative version of the*algc>rithm is as follows and as illustrated in Figure 3: 
20 Initialization: 

Step 300: Initialise parameters: 

/ = 0, where I denotes the iteration number, 

v (0 = v 0> where v is the parameter set of the inverse function F v , and 
* <o 

X where Y is the received input speech (the testing speech), and X is an 

25 estimate of the originally uttered speech; and 

Estimate an initial signal condition (£ represents the signal condition, like 
the SNR or bandwidth) 

Recursion; 

Step 310: Select a matched bootstrap,modeh>l x (^, e.g. from*a set^of stored models 320 



30 Step 330: Recognise the speech: S m - arg max P 



Step 340: Check a predetermined stop criterion. If the criterion is met, then STOP and 
OUTPUT 5 (350) 



S 
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(A CO 

Step 3''0: Estimate the original speech: X ^ F flt ^ (X ) 

Step 3U0: Increase the estimate of the signal condition: = § (0 + &, (5 > 0) 

Reiteritte: / «- / + 1 and go to step 3 10 

5 

In step 310, a bootstrap model is selected matching the signal condition 
Preferably, the system comprises several models each optimised for a different signal 
condition. The selection then simply involves loading the model associated with the signal 
condiiion £ (0 . Such a set of models can be created from the same original •clean* speech 
10 recording. For instance, for the SISM algorithm white Gaussian-noise may be added to the 
clean speech to contaminate the signal to a desired SNR, followed by training a recognition 
model from the contaminated speech signals. The model is then stored in association with the 
SNR (£). Thi6 can be done for several SNRs, resulting in a set of retrained models. Of course, 
also recordings of speech may be made under various signal conditions, where the models arc 
1 S then created from the original recording instead of from contaminated recordings. 

In step 340, for speech recognition preferably the stop criterion is based on the 
recognition result with the current model. If the recognition result is sufficient (e.g, based on 
confidence measures) or the likelihood does not increase anymore, the iteration may be 
stopfied 

20 It will be appreciated that in step 360 and 370, an estimate of the original 

speeoh is based on the inverse function F v . In principle, also other suitable methods may be 
used to map the current speech signal to an improved estimate, preferably using the currently 
selected model Ar(£). 

In a non-iterative version of the algorithm, it is sufficient to only once perform 

25 step 370. This may for instance be achieved by executing the following sequence: steps 300, 
310, 360, 370, 380, 310, and 330, followed by outputting the recognition result (step 350). 

Gen er a! Properties ; 

1. pfr(£)\ A,(f )h Hy&\ Kit)) M €' *S - whwe I £ dcnote * e "P 1 * 1 

30 condition (e.g. SNR or bandwidth) and Y(£) denotes the testing speech at signal condition & 
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This property implies that the matched performance of, for instance, high SNR or wide 
bandw .dth is better than the one of low SNR or narrow bandwidth, 

2, P{Y($)\A Jt g)}ZPfy(Z)\* x (£)}f° r & ' • where % and £ denote SNR only 
in this property. 

5 

SSM's Properties: 

1 , According to above two properties, the local maximum of P will be located at 

the £ • j e || (0 \ £ C0) + ©1 © > 0 . It means that overcoming the matched performances is 
possible. 

10 2. The decoded sequence, S={St> 1 <= i <= Tf> can be expected to be optimal 

solutkn in each recursive step by automatically selecting the matched bootstrap model. 

3. The models A/Q which are well trained in different signal conditions (different 
SNRs for SISM or different bandwidths for BISM) axe the bootstrap models for gaining the 
discriroination. 

15 The initial joint bootstrap operation is a core feature in the SSM algorithm. In 

the initial step, a matched model is selected as a discrimination seed for further bootstrap 
operations. It is an optimal initialization with the most discrimination power. It means that the 
seed can get the least mismatch between model and input signal based on the sense of 
maximum likelihood estimation. In a bootstrap step, the model is varied by the function of 

20 signal condition, like SNR or bandwidth, i.e. Ax(£) } and the testing speech is also updated to an 
estimate of the original speech (e.g. by the inverse function, FJ. It implies "push-pull" to the 
recognition performances of higher SNR for SISM or wider bandwidth for BISM. Preferably, 
the bootstrap operation is performed in an iterative manner. In this way, the signal condition 
can be improved successively (e.g. increasing the SNR or bandwidth) for the mutual 

25 optim -sation of features and models. 

In the SSM algorithm., in step 300 an initial estimate is made of the signal 
condiiion (SNR for SISM or bandwidth for BISM) in order to select a matched bootstrap 
mode! as discrimination seed. The initial estimate may be based on typical conditions for a 
specific application. Also a (simple) test of the signal may be done. The optimal state/mixture 

30 sequence can be obtained via matched bootstrap models in each recursive step. An exemplary 
block diagram of a speech recognition system using the SSM algorithm is shown in Figure 4. 
In block 410 features are extracted from the received speech signal. This may be done in a 
manner described for the spectral analysis subsystem 1 10 of Fig, 1. In block 420, an estimate 
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is made of the signal condition. This may be based on measuring/estimating such a condition 
in a known way, or may simply be a conservative estimate (only a moderate degradation as 
typically exists minimally for the given application). In block 430 f the speech is processed in 
the nojuial way (e.g. in a manner described for the unit matching subsystem 120 of Fig.l), 
5 where according to the invention a bootstrap model matching the estimated signal condition is 
obtain^ from a storage 440 which comprises a set of models for different signal conditions. 
As described for Fig. 3, the processing is also changed in that an estimate is made of the 
original speech input Moreover, the iterative procedure of Fig. 3 may be followed. 

The BISM can be applied to narrow-band speech recognition using bandwidth 
10 incremental approach in order to obtain the accuracy of high quality models trained from 
£ microphone speech. It is well known that the performance of telephone speech is worse than 
microphone speech even at noise-free condition. The BISM can break through the traditional 
performance of telephone speech recognition accuracy. Advantageously, the SISM and BISM 
algori iuns are combined for noisy narrow-band speech recognition. Figure 5 shows a block 
15 diagrsun of a speech recognition system using both algorithms. In this embodiment, which is 
for instance suitable for recognition of noisy telephone speech, the SISM and BISM 
algorithms are cascaded to remove the noise effects using telephone bootstrap models and to 
approach the performance of high quality microphone models using microphone bootstrap 
models. In block 500 features are extracted from the received speech signal. This may be done 
20 in a manner described for the spectral analysis subsystem 1 10 of Kg. 1 . In block 510, an 

estimate is made of two signal conditions. In the example, an estimate is made of the SNR and 
^ of the signal bandwidth. The estimate may be based on measuring/estimating such a condition 
in a known way, or may simply be a conservative estimate (only a moderate degradation as 
typiciilly exists minimally for the given application). In block 520, the speech is processed in 
25 the normal way (e.g. in a manner described for the unit matching subsystem 120 of Fig.1), 

where according to the invention a bootstrap model matching the estimated signal condition is 
obtai ted from a storage 530 which comprises a set of models for different signal conditions. In 
the shown example, the bootstrap models are optimised for different SNRs of the input signal- 
As described for Kg. 3, the processing is also changed in that an estimate is made of the 
30 original speech input Moreover, the iterative procedure of Fig. 3 may be followed. In this way 
suitable model(s) for processing at this SNR are located and the input signal is transformed to 
an estimated original signal, assuming this SNR. Following this, a same procedure is used in 
block 540 for the bandwidth, where the models for the various bandwidths are retrieved from a 
storage 550. In the example, it is also possible to integrate the storages 530 and 550. For 
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instance, for each supported SNR level a set of models may be stored, each having a different 
bandwidth. This enables a simple procedure for performing both optimisations. For instance, 
assuming a default or estimated bandwidth, first the most appropriate model for the SNR is 
determined, preferably in an iterative manner. This results in identifying a set of models for 
S that SNR, where the models differ in bandwidth. In a next process, then the model best 

match: ng the bandwidth is selected from that set of models. It will be appreciated that instead 
of cascading the two processing steps also an integrated procedure can be made. 

The SSM algorithm can be applied to robust speech coding by using bootstrap 
codebx>ks/encoder instead of bootstrap models/recognise^ i.e. 4*(£) denotes the bootstrap 
10 codebaoks. The SISM algorithm can improve the quality of microphone or telephone speech 
coding to high SNR level in adverse environments. And, the BISM algorithm even can 
improve the telephone speech coding to microphone (or wider bandwidth) quality. It means 
that it is possible to transmit the coded speech with microphone quality through telephone 
networks by using the BISM algorithm for telephone speech coding because the telephone 
IS speech can be decoded by using microphone codebooka. The implementation of SSM for 
speec i coding is similar to that one described for recognition by replacing the bootstrap 
models by bootstrap codebooks.* TKfe block diagram of Fig. 5 also applies to noisy telephone 
speech coding. THS*output is the codebook entryr 



20 Results: 

Experiments were performed to evaluate the principal performance boundaries 
of adapted and retrained models under added noise conditions. Adapted models fully alter the 
parameters of Hidden Markov Models (HMM) from clean ones in order to match the noisy test 
environment. Retrained models are fully trained from white Gaussian-noise contaminated 

25 speech at matched signal-to-noise ratio (SNR) environments. As described above, such 
retrained models can be used in the SEMD algorithm. The capabilities and limitations of 
adapted models and retrained models have been studied. The results show that the concept of 
using; retrained models according to the invention provides a better performance than using 
adapted models. This holds for any conditions but especially for low SNRs. The results show 

30 that phone error rates for retrained models are about 6% better than for adapted models. It has 
also seen found that the retrained models improve the word error rate by 6% for 15-dB SNR 
and <?ven by 18% for 0-dB SNR. Details arc provided below. 

The model retraining technique has been compared to the known technique of 
mod;! adaptation/ transformation. In this known technique, the models are adapted onto the 
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test environments. The resulting performance depends on the state-to-frame alignment and is 
often hounded by the performance in matched conditions. Maximum likelihood linear 
regress ion (MLLR) is used to adapt the models into new environments. Stochastic matching 
(SM) modifies features or models in order to match the environmental change. 
5 The principal limitations of algorithms for model adaptation/transformation has 

been studied by using fully adapted models as has been described above for US 5,727,124 and 
retrained models according to the invention. The fully adapted model is used to simulate that 
the added noise can be estimated accurately for model re-estimation. The experimental set-up 
of the fully adapted models is as follows: 
10 Step 1 : The clean speech of training corpus is segmented by means of clean models, 

^ and the paths are kept for noisy model training. 

Step 2: Different levels of added noise are added into the test utterances. All HMM 

parameters are re-estimated without any further iteration. 
The retrained models are fully trained from noisy speech at matched SNR 
15 environments like the training of clean models. White Gaussian noise was added to the testing 
utterances at different total SNR levels. The total SNR is defined as follows, where <r* is the 
variance of testing speech utterance and cr B 2 is the variance of added noise. 



TotalSNR = 10 log l0 



(<JB) 



Experiments were performed on the "Japanese Electronic Industry 
Development Association's Common Speech Data Corpus" (JSDC) being mainly an isolated- 
phrase, corpus. The JSDC corpus was recorded with dynamic microphones and sampled at 16 
kHz. "The phonetically rich JSDC city-name subcorpus was used to train phone-based HMMs. 
In the experiments 35 monophone HMMs were deployed with three states per model and 
25 nomiral 32 Laplacian mixture densities per state. The JSDC control-word corpus with a 
vocabulary of 63 words was used as testing material. 

Experiments for free-phone decoding and word recognition were performed. 
The resulting phone and word error rates are shown in Fig,6 and Pig. 7, respectively. 
Horizontally, the SNR is shown in dB. Vertically the respective error rates are shown (in 
30 percentages). The following curves are shown: 

1. Corrupted performance: The models are clean and the test material is corrupted by added 
white Gaussian noise, where clean means there is no noise added. 
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2* Fully adapted performance: The models are adapted from clean ones based on known 

noise levels and the test material is corrupted at the same SNR levels. 
3. Reirained performance: The models are fully retrained in known SNR environments and 

the test material is corrupted at the same SNR levels. 
5 It has been found that retrained models perform always better than adapted 

model*, under any condition but especially at low SNR levels. Fig.6 shows that phone error 
rates for retrained models are about 6% better than for adapted models. From Fig. 7, it also can 
be seen that retrained models improve the word error rate by 6% of for 15-dB SNR-and-even 
by 1895 forOrdB SNR. 

10 Further experiments were carried out on JNAS (Japanese Newspaper Article 

Sentence) database provided by AS J (Acoustic Society of Japan). JNAS contains 306 speakers ^ 
(153 ir ales and 153 females) reading excerpts from the Mainichi Newspaper (100 sentences) 
and the ATR 503 PB Sentences (50 phonetically balanced sentences). As in the experiments 
descril»ed above, White Gaussian noise was added to the testing utterances at different SNR 

15 levels. In this experiment, 35 context-independent monophone HMMfc were deployed with 

three szates per model and nominal 16 Gaussian mixture-densities per«astate in our experiments. 
Japanese phone recognition* was performed with the constraint of syllable topology. The 
further experiments, as iMstrated B in-Fi]pi^^ overcome the 

retrain ?d performances which are usually viewed as the upper bounds at all SNR levels. 

20 Horizontally, the SNR is shown in dB. Vertically the respective error rates are shown (in 
percentages). 
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CLAIMS: 



1. A method of processing a degraded speech input signal; the method including: 

- receiving the degraded speech input signal; 

; - estimating a condition, such as the signal-to-noise ratio or bandwidth, of the 
received input signal; 

5 - selecting a processing model corresponding to the estimated signal condition; 

| - estimating an originally uttered speech signal based on the received Input 

signal; 

- processing the estimated original signal according to the selected model; and 
- outputting a processing result. ... 

10 

2. The method as claimed in claim 1, wherein the step of estimating the originally 
uttered speech signal includes determining a most likely uttered speech signal given a 
predetsnnined processing model. 

15 3. The method as claimed in claim 2, wherein the predetermined processing model 

is a processing model selected as corresponding to the estimated signal condition. 

^ 4. The method as claimed in claim 3 t wherein the method includes iterative] y; 

- performing a new estimate of the signal condition of the received input signal; 
20 - selecting a processing model corresponding to the newly estimated signal 

condition; 

- estimating an originally uttered speech signal based on the estimated original 
signal of an immediately preceding iteration given the selected processing model; 

- processing the estimated original signal according to the selected model; 
25 and terminating the iteration when a predetermined condition is met. 

5. The method as claimed in claim 4, wherein the iteration is terminated if a 

processing result no longer improves. 
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6. The method as claimed in claim 4, wherein performing a new estimate of the 
sign U condition includes selecting a more degraded signal condition. 

7. A method as claimed in claim 1, wherein the speech processing involves 
5 recognizing speech and the processing model is a speech recognition model? 

8. A method asxlaimed in claim 1, wherein the speech processing Involves coding 
speech and the processing model is a speech codebook/encoder. 

10 9. A speech processing system for processing a degraded speech input signal; the 

system including: 

- an input for receiving the degraded speech input signal; 

- means for estimating a condition, such as the signal-to-noise ratio or 
ban d\ /Kith, of the received input signal; 

15 - means for selecting a processing model coirespondingito theiestimated signal 

corididon;^ 

- means Jor*estimating an^origig^ 

input signal; 

- means for^processing the estimated original signal-according to the selected 

20 model; anda 

- an output for outputting a processing result. 
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ABSTRACT: 



10 



A speech processing system, such as a speech recognition or speech coding 
system, is capable for processing a degraded speech input signal. The system includes an input 
for receiving the degraded speech input signal. Means 420 are used for estimating a condition, 
such as the signal-to-noise ratio or bandwidth* of the received input signal. Means 430 are 
used means for selecting a processing model which corresponds to the estimated signal 
condition. The model may be retrieved from a storage 440 with models for different signal 
condit ons. Means 430 are also operable to estimate an originally uttered speech signal based 
on the received input signal and to process the estimated original signal according to the 
selected model. 
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